❝缺失值是常见的,通常分为3类缺失值的情形
- 完全随机缺失 「Missing Completely at Random (MCAR)」 ,数据的缺失是完全随机,并不受任何人为或客观因素的干扰,这种情形在实际中几乎不存在或非常少见;
- 随机缺失 「Missing at Random (MAR)」 ,数据的测定遭受了一定的系统和可预测的误差,比如说对于具有感冒症状的患者,通常而言他们具有咳嗽,头晕等症状,但体温可能不会测量,默认他们是处于发热状态,或者仅凭触感认为是发热,但实际没测量,这种情况下的缺失值是可以预测的,并可能通过其他数据进行预测和填补;
- 非随机缺失 「Missing not at Random (MNAR)」 ,数据的缺失是由于非系统和可预测的方式造成的,这种情况下应尽可能记录缺失的原因,来推测可能存在的数据偏倚。
❞
加载R包
install.packages("simputation", dependencies = TRUE)
library(simputation)
library(tidyverse)
simputation提供了很多插补缺失值的方法,主要是以下插补方法:
- Donor imputation (including various donor pool specifications)
- sequential hotdeck (LOCF, NOCB)
- Proxy imputation: 使用其他列的值或使用简单的转换得到的值.
- Apply trained models for imputation purposes.
使用方法
impute_<model>(data, formula, [model-specific options])
impute_rlm
: robust linear modelimpute_en
: ridge/elasticnet/lassoimpute_rhd
: random hot deckimpute_shd
: sequential hot deckimpute_knn
: k nearest neighboursimpute_lm
: linear regressionimpute_pmm
: Hot-deck imputationimpute_proxy
: 自定义公式插补,可以用均值等
data
是需要插补的数据框
formula
指定需要插补的列。
[model-specific options]
是根据所选模型不同有不同的参数。
线性回归插补
dat <- iris
dat[1:3, 1] <- dat[3:7, 2] <- dat[8:10, 5] <- NA
head(dat, 10)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|
NA | 3.5 | 1.4 | 0.2 | setosa |
NA | 3.0 | 1.4 | 0.2 | setosa |
NA | NA | 1.3 | 0.2 | setosa |
4.6 | NA | 1.5 | 0.2 | setosa |
5.0 | NA | 1.4 | 0.2 | setosa |
5.4 | NA | 1.7 | 0.4 | setosa |
4.6 | NA | 1.4 | 0.3 | setosa |
5.0 | 3.4 | 1.5 | 0.2 | NA |
4.4 | 2.9 | 1.4 | 0.2 | NA |
4.9 | 3.1 | 1.5 | 0.1 | NA |
da1 <- impute_lm(dat, Sepal.Length ~ Sepal.Width + Species)
head(da1, 10)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|
5.076579 | 3.5 | 1.4 | 0.2 | setosa |
4.675654 | 3.0 | 1.4 | 0.2 | setosa |
NA | NA | 1.3 | 0.2 | setosa |
4.600000 | NA | 1.5 | 0.2 | setosa |
5.000000 | NA | 1.4 | 0.2 | setosa |
5.400000 | NA | 1.7 | 0.4 | setosa |
4.600000 | NA | 1.4 | 0.3 | setosa |
5.000000 | 3.4 | 1.5 | 0.2 | NA |
4.400000 | 2.9 | 1.4 | 0.2 | NA |
4.900000 | 3.1 | 1.5 | 0.1 | NA |
使用中位数进行插补
da2 <- impute_median(da1, Sepal.Length ~ Species)
head(da2, 10)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|
5.076579 | 3.5 | 1.4 | 0.2 | setosa |
4.675654 | 3.0 | 1.4 | 0.2 | setosa |
5.000000 | NA | 1.3 | 0.2 | setosa |
4.600000 | NA | 1.5 | 0.2 | setosa |
5.000000 | NA | 1.4 | 0.2 | setosa |
5.400000 | NA | 1.7 | 0.4 | setosa |
4.600000 | NA | 1.4 | 0.3 | setosa |
5.000000 | 3.4 | 1.5 | 0.2 | NA |
4.400000 | 2.9 | 1.4 | 0.2 | NA |
4.900000 | 3.1 | 1.5 | 0.1 | NA |
使用决策树进行插补
.
代表了除Species
之外的所有变量
da3 <- impute_cart(da2, Species ~ .)
head(da3,10)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|
5.076579 | 3.5 | 1.4 | 0.2 | setosa |
4.675654 | 3.0 | 1.4 | 0.2 | setosa |
5.000000 | NA | 1.3 | 0.2 | setosa |
4.600000 | NA | 1.5 | 0.2 | setosa |
5.000000 | NA | 1.4 | 0.2 | setosa |
5.400000 | NA | 1.7 | 0.4 | setosa |
4.600000 | NA | 1.4 | 0.3 | setosa |
5.000000 | 3.4 | 1.5 | 0.2 | setosa |
4.400000 | 2.9 | 1.4 | 0.2 | setosa |
4.900000 | 3.1 | 1.5 | 0.1 | setosa |
使用多重插补
- 按照顺序分别使用线性回归,中位数,决策树分别进行插补
da4 <- dat %>%
impute_lm(Sepal.Length ~ Sepal.Width + Species) %>%
impute_median(Sepal.Length ~ Species) %>%
impute_cart(Species ~ .)
head(da4,10)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|
5.076579 | 3.5 | 1.4 | 0.2 | setosa |
4.675654 | 3.0 | 1.4 | 0.2 | setosa |
5.000000 | NA | 1.3 | 0.2 | setosa |
4.600000 | NA | 1.5 | 0.2 | setosa |
5.000000 | NA | 1.4 | 0.2 | setosa |
5.400000 | NA | 1.7 | 0.4 | setosa |
4.600000 | NA | 1.4 | 0.3 | setosa |
5.000000 | 3.4 | 1.5 | 0.2 | setosa |
4.400000 | 2.9 | 1.4 | 0.2 | setosa |
4.900000 | 3.1 | 1.5 | 0.1 | setosa |
使用固定值进行插补
da4 <- impute_const(dat, Sepal.Length ~ 7)
head(da4,10)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|
7.0 | 3.5 | 1.4 | 0.2 | setosa |
7.0 | 3.0 | 1.4 | 0.2 | setosa |
7.0 | NA | 1.3 | 0.2 | setosa |
4.6 | NA | 1.5 | 0.2 | setosa |
5.0 | NA | 1.4 | 0.2 | setosa |
5.4 | NA | 1.7 | 0.4 | setosa |
4.6 | NA | 1.4 | 0.3 | setosa |
5.0 | 3.4 | 1.5 | 0.2 | NA |
4.4 | 2.9 | 1.4 | 0.2 | NA |
4.9 | 3.1 | 1.5 | 0.1 | NA |
复制其他变量值进行插补
- 复制
Sepal.Width
的数值来插补Sepal.Length
da4 <- impute_proxy(dat, Sepal.Length ~ Sepal.Width)
head(da4,10)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|
3.5 | 3.5 | 1.4 | 0.2 | setosa |
3.0 | 3.0 | 1.4 | 0.2 | setosa |
NA | NA | 1.3 | 0.2 | setosa |
4.6 | NA | 1.5 | 0.2 | setosa |
5.0 | NA | 1.4 | 0.2 | setosa |
5.4 | NA | 1.7 | 0.4 | setosa |
4.6 | NA | 1.4 | 0.3 | setosa |
5.0 | 3.4 | 1.5 | 0.2 | NA |
4.4 | 2.9 | 1.4 | 0.2 | NA |
4.9 | 3.1 | 1.5 | 0.1 | NA |
使用稳健线性回归进行插补
da5 <- impute_rlm(dat, Sepal.Length + Sepal.Width ~ Petal.Length + Species)
head(da5)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|
4.945416 | 3.500000 | 1.4 | 0.2 | setosa |
4.945416 | 3.000000 | 1.4 | 0.2 | setosa |
4.854057 | 3.378979 | 1.3 | 0.2 | setosa |
4.600000 | 3.440107 | 1.5 | 0.2 | setosa |
5.000000 | 3.409543 | 1.4 | 0.2 | setosa |
5.400000 | 3.501236 | 1.7 | 0.4 | setosa |
使用均值和残差来插补
- 以
Species
来分组,按照线性回归计算残差加上平均值进行插补
da6 <- impute_lm(dat, . - Species ~ 0 + Species, add_residual = "normal") # Species用来分组
head(da6)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|
4.345936 | 3.500000 | 1.4 | 0.2 | setosa |
4.499524 | 3.000000 | 1.4 | 0.2 | setosa |
5.647977 | 3.202251 | 1.3 | 0.2 | setosa |
4.600000 | 3.611873 | 1.5 | 0.2 | setosa |
5.000000 | 2.935475 | 1.4 | 0.2 | setosa |
5.400000 | 3.336509 | 1.7 | 0.4 | setosa |
分组插补
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA
da8 <- impute_lm(dat, Sepal.Length ~ Petal.Width | Species)
head(da8)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|
4.968092 | 3.5 | 1.4 | 0.2 | setosa |
4.968092 | 3.0 | 1.4 | 0.2 | setosa |
4.968092 | NA | 1.3 | 0.2 | setosa |
4.600000 | NA | 1.5 | 0.2 | setosa |
5.000000 | NA | 1.4 | 0.2 | setosa |
5.400000 | NA | 1.7 | 0.4 | setosa |
dat %>% group_by(Species) %>%
impute_lm(Sepal.Length ~ Petal.Width)
使用impute_proxy
自定义插补方法
- 自定义一个
robust ratio imputation
方法进行插补
dat <- impute_proxy(dat, Sepal.Length ~ median(Sepal.Length,na.rm=TRUE)/median(Sepal.Width, na.rm=TRUE) * Sepal.Width | Species)
head(dat)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|
5.147059 | 3.5 | 1.4 | 0.2 | setosa |
4.411765 | 3.0 | 1.4 | 0.2 | setosa |
NA | NA | 1.3 | 0.2 | setosa |
4.600000 | NA | 1.5 | 0.2 | setosa |
5.000000 | NA | 1.4 | 0.2 | setosa |
5.400000 | NA | 1.7 | 0.4 | setosa |
使用平均值进行插补
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA
dat <- impute_proxy(dat, Sepal.Length ~ mean(Sepal.Length,na.rm=TRUE))
head(dat)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|
5.862585 | 3.5 | 1.4 | 0.2 | setosa |
5.862585 | 3.0 | 1.4 | 0.2 | setosa |
5.862585 | NA | 1.3 | 0.2 | setosa |
4.600000 | NA | 1.5 | 0.2 | setosa |
5.000000 | NA | 1.4 | 0.2 | setosa |
5.400000 | NA | 1.7 | 0.4 | setosa |
使用分组后平均值进行插补
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA
dat <- impute_proxy(dat, Sepal.Length ~ mean(Sepal.Length,na.rm=TRUE) | Species)
head(dat)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|
5.012766 | 3.5 | 1.4 | 0.2 | setosa |
5.012766 | 3.0 | 1.4 | 0.2 | setosa |
5.012766 | NA | 1.3 | 0.2 | setosa |
4.600000 | NA | 1.5 | 0.2 | setosa |
5.000000 | NA | 1.4 | 0.2 | setosa |
5.400000 | NA | 1.7 | 0.4 | setosa |
使用其他数据集中训练过的模型插补数据
- 这里的训练数据集可以是非常多种方法,可以自己定义,比如说逻辑回归,线性回归,决策树,十折交叉验证等等算法;这里采用最简单的线性回归
m <- lm(Sepal.Length ~ Sepal.Width + Species, data = iris)
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA
head(dat)
dat <- impute(dat, Sepal.Length ~ m)
head(dat)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|
5.063856 | 3.5 | 1.4 | 0.2 | setosa |
4.662076 | 3.0 | 1.4 | 0.2 | setosa |
NA | NA | 1.3 | 0.2 | setosa |
4.600000 | NA | 1.5 | 0.2 | setosa |
5.000000 | NA | 1.4 | 0.2 | setosa |
5.400000 | NA | 1.7 | 0.4 | setosa |
实际取得的效果与线性回归插补是一致的!
❝除此之外还有其他更高级的插补算法,具体可参考VIM package参考文档
❞
致谢
❝本文内容参考simputation Vignettes文档
❞
❝希望以上内容有所帮助,喜欢请点赞,转发,赞赏请随意,谢谢支持!
❞